{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "08301c9f",
   "metadata": {},
   "source": [
    "# 🚀 DeepSeek R1 × WaterCrawl — Live Chat with Any Website URL\n",
    "\n",
    "Welcome to this **step‑by‑step Jupyter Notebook tutorial**. By the end, you’ll be able to:\n",
    "1. **Crawl & clean** any webpage with **WaterCrawl**.\n",
    "2. **Summarize** the scraped content with **DeepSeek R1**.\n",
    "3. **Chat interactively** about the page — all within a single Python class.\n",
    "\n",
    "> **Important:** This demo *does not* store data in a vector database or perform similarity‑based retrieval. Everything happens on‑the‑fly in memory — perfect for quick analyses, but not for large‑scale production RAG pipelines."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55a59efb",
   "metadata": {},
   "source": [
    "## 🗝️ Key Features\n",
    "- **One‑click crawling**: extract clean Markdown from any URL.\n",
    "- **AI summaries**: condense long pages into factual bullet points.\n",
    "- **Context‑aware answers**: ask follow‑up questions that cite page content.\n",
    "- **Multi‑site sessions**: load several pages and keep chatting about them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e42a0b0",
   "metadata": {},
   "source": [
    "## ⚙️ Setup & Installation\n",
    "```bash\n",
    "# In a fresh environment ⬇️\n",
    "pip install watercrawl-py deepseek-ai python-dotenv rich requests\n",
    "```\n",
    "Create a `.env` file in the same folder:\n",
    "```env\n",
    "WATERCRAWL_API_KEY=\"your_watercrawl_key\"\n",
    "DEEPSEEK_API_KEY=\"your_deepseek_key\"\n",
    "```\n",
    "*(Free WaterCrawl keys are available at [app.watercrawl.dev](https://app.watercrawl.dev). DeepSeek R1 keys can be requested from their team.)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "43a051ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install watercrawl-py deepseek-ai python-dotenv rich requests\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cd627958",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ▶️ Run this cell after adding your keys to .env\n",
    "from dotenv import load_dotenv\n",
    "import os\n",
    "load_dotenv()\n",
    "\n",
    "WATERCARWL_KEY = os.getenv(\"WATERCRAWL_API_KEY\")\n",
    "DEEPSEEK_KEY   = os.getenv(\"DEEPSEEK_API_KEY\")\n",
    "\n",
    "assert WATERCARWL_KEY and DEEPSEEK_KEY, \"🔑 Add your API keys to the .env file first!\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1660b40f",
   "metadata": {},
   "source": [
    "## 🤖 DeepSeek R1 Lightweight Client\n",
    "A minimal wrapper around the HTTP API — no extra dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "f2659b1a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests, logging\n",
    "class DeepSeekClient:\n",
    "    \"\"\"Tiny helper for DeepSeek chat completions.\"\"\"\n",
    "    def __init__(self, api_key, base_url=\"https://api.deepseek.com\"):\n",
    "        self.session = requests.Session()\n",
    "        self.session.headers.update({\n",
    "            \"Authorization\": f\"Bearer {api_key}\",\n",
    "            \"Content-Type\": \"application/json\"\n",
    "        })\n",
    "        self.base_url = base_url\n",
    "\n",
    "    def chat(self, messages, model=\"deepseek-reasoner\", **kwargs):\n",
    "        url = f\"{self.base_url}/v1/chat/completions\"\n",
    "        payload = {\"model\": model, \"messages\": messages, **kwargs}\n",
    "        resp = self.session.post(url, json=payload, timeout=60)\n",
    "        resp.raise_for_status()\n",
    "        return resp.json()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c67eee0",
   "metadata": {},
   "source": [
    "## 📝 Logging Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "f6abcf49",
   "metadata": {},
   "outputs": [],
   "source": [
    "logging.basicConfig(level=logging.INFO,\n",
    "                    format='%(asctime)s - %(levelname)s - %(message)s')\n",
    "logger = logging.getLogger(\"LiveChat\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d31e1085",
   "metadata": {},
   "source": [
    "## 💬 Chat Message Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7d3572e7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from enum import Enum\n",
    "from dataclasses import dataclass, field\n",
    "from datetime import datetime\n",
    "from typing import Dict, Any, List\n",
    "\n",
    "class MessageType(Enum):\n",
    "    SYSTEM=\"system\"; USER=\"user\"; ASSISTANT=\"assistant\"; WEBSITE=\"website\"\n",
    "\n",
    "@dataclass\n",
    "class ChatMessage:\n",
    "    role: MessageType\n",
    "    content: str\n",
    "    timestamp: datetime = field(default_factory=datetime.now)\n",
    "    metadata: Dict[str, Any] = field(default_factory=dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e630a76",
   "metadata": {},
   "source": [
    "## 🤖 WebsiteChatBot Class\n",
    "Handles crawling, summarization, and conversation flow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "5525d906",
   "metadata": {},
   "outputs": [],
   "source": [
    "from watercrawl import WaterCrawlAPIClient\n",
    "from rich.progress import Progress, SpinnerColumn, TextColumn\n",
    "\n",
    "class WebsiteChatBot:\n",
    "    def __init__(self, watercrawl_key, deepseek_key):\n",
    "        self.watercrawl = WaterCrawlAPIClient(api_key=watercrawl_key)\n",
    "        self.deepseek  = DeepSeekClient(api_key=deepseek_key)\n",
    "        self.chat_history: List[ChatMessage] = []\n",
    "        self.website_content = {}\n",
    "        self._system(\"You are a helpful AI assistant…\")\n",
    "\n",
    "    # ——— Utility helpers ———\n",
    "    def _system(self, text):\n",
    "        self.chat_history.append(ChatMessage(MessageType.SYSTEM, text))\n",
    "    def _user(self, text):\n",
    "        self.chat_history.append(ChatMessage(MessageType.USER, text))\n",
    "    def _assistant(self, text):\n",
    "        self.chat_history.append(ChatMessage(MessageType.ASSISTANT, text))\n",
    "\n",
    "    # ——— Crawling ———\n",
    "    def crawl(self, url):\n",
    "        result = self.watercrawl.scrape_url(url=url,\n",
    "            page_options={\"exclude_tags\":[\"nav\",\"footer\",\"header\"],\n",
    "                          \"include_tags\":[\"article\",\"main\",\"section\"],\n",
    "                          \"include_html\":False})\n",
    "        markdown = result['result']['markdown']\n",
    "        self.website_content[url] = markdown\n",
    "        return markdown\n",
    "\n",
    "    # ——— Summarize ———\n",
    "    def summarize(self, content):\n",
    "        messages=[{\"role\":\"system\",\"content\":\"Summarize this page.\"},\n",
    "                  {\"role\":\"user\",\"content\":content[:4000]}]\n",
    "        resp = self.deepseek.chat(messages, temperature=0.3)\n",
    "        return resp['choices'][0]['message']['content']\n",
    "\n",
    "    # ——— Q&A ———\n",
    "    def ask(self, question):\n",
    "        ctx = \"\\n\\n\".join(self.website_content.values())[:5000]\n",
    "        messages=[{\"role\":\"system\",\"content\":\"Context:\"+ctx},\n",
    "                  {\"role\":\"user\",\"content\":question}]\n",
    "        resp = self.deepseek.chat(messages, max_tokens=800)\n",
    "        answer = resp['choices'][0]['message']['content']\n",
    "        self._assistant(answer)\n",
    "        return answer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28dd1a15",
   "metadata": {},
   "source": [
    "## 🔍 Quick Demo "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5fb58ae0",
   "metadata": {},
   "outputs": [],
   "source": [
    "bot = WebsiteChatBot(WATERCARWL_KEY, DEEPSEEK_KEY)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0fef5891",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Summary of the Page:**\n",
      "\n",
      "The page highlights **WaterCrawl v0.7.1**, a web crawling tool update focused on smarter search capabilities, Google Custom Search integration, real-time status tracking, and transparent credit management. Released on May 3, 2025, by Amir Asaran, it emphasizes transforming web content into structured data for LLM training and analysis.\n",
      "\n",
      "**Featured Articles:**  \n",
      "1. **AI Communication Protocols**: Alireza Mofidi explains MCP (Model Context Protocol) and A2A (Agent-to-Agent Protocol), frameworks enabling AI systems to interact contextually and autonomously.  \n",
      "2. **LLM Serving Frameworks**: A guide comparing on-premises deployment tools like vLLM and TGI for optimizing GPU performance and scalability, also by Alireza Mofidi.  \n",
      "\n",
      "**Key Features of WaterCrawl:**  \n",
      "- **Smart Crawling**: Control depth, domains, and paths for targeted extraction.  \n",
      "- **Precision Extraction**: Customizable selectors to filter unwanted content.  \n",
      "- **AI Integration**: OpenAI-powered processing for structured data output.  \n",
      "- **Extensible Plugins**: Customizable workflows via plugins.  \n",
      "- **JavaScript Rendering**: Captures dynamic content and screenshots.  \n",
      "- **Open Source**: Transparent and customizable under an open-source model.  \n",
      "\n",
      "The page also promotes a **Playground Interface** for testing crawlers and invites users to explore use cases like content analysis and LLM training. Featured articles and resources (blog, documentation, GitHub) are highlighted, along with community links (Discord, newsletter).\n"
     ]
    }
   ],
   "source": [
    "page = bot.crawl(\"https://watercrawl.dev/\")\n",
    "print(bot.summarize(page))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d03bfdb2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"The main topic of the page is **transforming web content into structured, LLM-ready data using WaterCrawl**. It positions WaterCrawl as a tool designed to:\\n\\n1. **Convert websites into knowledge bases** for:\\n   - Training LLMs (Large Language Models)\\n   - Content analysis\\n   - Data-driven applications\\n\\n2. **Provide web crawling infrastructure** with features like:\\n   - Smart content extraction (filtering ads/noise)\\n   - AI-powered processing (OpenAI integration)\\n   - JavaScript rendering\\n   - Customizable plugins\\n   - Integration with popular AI/LLM stacks (Dify, n8n, Langchain, etc.)\\n\\nThe page emphasizes WaterCrawl's role in bridging raw web content and AI applications, particularly highlighting its utility for developers and teams working with LLMs. Supporting articles about AI protocols (MCP/A2A) and LLM serving frameworks further contextualize its focus on AI/LLM infrastructure.\""
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bot.ask(\"What is the main topic of the page?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fbe64cfe",
   "metadata": {},
   "source": [
    "## 🚦 Next Steps\n",
    "- Plug this notebook into LangChain or LlamaIndex for vector‑based retrieval.\n",
    "- Cache crawled pages locally to avoid re‑scraping.\n",
    "- Add citation snippets to answers for greater transparency.\n",
    "\n",
    "---\n",
    "© 2025 – Feel free to remix and share!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
